Tissue of origin prediction for cancer of unknown primary using a targeted methylation sequencing panel

Rationale Cancer of unknown primary (CUP) is a group of rare malignancies with poor prognosis and unidentifiable tissue-of-origin. Distinct DNA methylation patterns in different tissues and cancer types enable the identification of the tissue of origin in CUP patients, which could help risk assessment and guide site-directed therapy. Methods Using genome-wide DNA methylation profile datasets from The Cancer Genome Atlas (TCGA) and machine learning methods, we developed a 200-CpG methylation feature classifier for CUP tissue of origin prediction (MFCUP). MFCUP was further validated with public-available methylation array data of 2977 specimens and targeted methylation sequencing of 78 Formalin‐fixed paraffin‐embedded (FFPE) samples from a single center. Results MFCUP achieved an accuracy of 97.2% in a validation cohort (n = 5923) representing 25 cancer types. When applied to an Infinium 450 K array dataset (n = 1052) and an Infinium EPIC (850 K) array dataset (n = 1925), MFCUP achieved an overall accuracy of 93.4% and 84.8%, respectively. Based on MFCUP, we established a targeted bisulfite sequencing panel and validated it with FFPE sections from 78 patients of 20 cancer types. This methylation sequencing panel correctly identified tissue of origin in 88.5% (69/78) of samples. We also found that the methylation levels of specific CpGs can distinguish one cancer type from others, indicating their potential as biomarkers for cancer diagnosis and screening. Conclusion Our methylation-based cancer classifier and targeted methylation sequencing panel can predict tissue of origin in diverse cancer types with high accuracy. Supplementary Information The online version contains supplementary material available at 10.1186/s13148-024-01638-6.


Introduction
Cancer of unknown primary (CUP), accounting for about 2% of all cancer diagnoses, is a heterogeneous group of metastatic malignancies without identifiable primary tumor sites.CUP can be categorized into favorable and unfavorable subsets [1,2].Through a standard diagnostic workup, 15-20% of patients with CUP can be assigned to a putative primary tumor site [3].Patients in these subsets typically receive site-specific therapies and have favorable outcome.The favorable-CUP subsets encompass head and neck squamous cell carcinoma, breast, ovarian, prostate, kidney, and colorectal cancer [1].The remaining patients with CUP (80-85%) fall into the unfavorable subset and will receive empiric chemotherapies [3].The favorable-CUP and unfavorable-CUP have median overall survivals (OS) of 11.7 months and 3.9 months, respectively [2].The 1-year survival rates in these two subsets were 45% and 11%, respectively [2].
The initial evaluation for CUP includes a thorough physical examination, basic blood tests, CT/MRI scans, endoscopies, and microsatellite instability (MSI)/mismatch repair deficiency (dMMR) testing [3].The major CUP histologies include well to moderately differentiated adenocarcinomas (~ 50%), poorly or undifferentiated adenocarcinomas (~ 30%), squamous-cell carcinomas (~ 15%), and undifferentiated neoplasms (~ 5%) [3].Although a routine histopathological workout can determine the most likely cell lineages of CUP, it cannot define the primary tumor site for most CUP cases [4].The identification of tissue of origin in patients with unfavorable-CUP can reassign them to the favorable-CUP subsets and enable the application of site-specific therapies [3].
Epigenetic modifications, including DNA methylation, play an important role in the regulation of tissue-specific gene expression and cellular identity [5].Distinct DNA methylation pattern in different tissue and cancer types, making it a promising tool for cancer classification.The TCGA project has generated genome-wide DNA methylation profiles of 10,814 tumor samples in 33 cancer types [6].This extensive methylation dataset enables the development of cancer classifiers, which can be used for CUP diagnosis [7].
DNA methylation profiling has been used in the classification of sarcoma, central nervous system (CNS) and sinonasal tumors [8][9][10].Methylation classifiers also showed promising results in tissue of origin prediction among patients with CUP or head and neck squamous cell cancers with unknown primary (HNSCC-CUPs) [11,12].The primary goal of this study is to develop an affordable and accessible targeted methylation next-generation sequencing panel for CUP diagnosis.Furthermore, we discovered candidate CpGs whose methylation status can distinguish one cancer type from others.

Feature selection and classifier development
Whole-genome Illumina Infinium HumanMethyla-tion450 (450 K) BeadChip array data across 22 cancer types and adjacent normal tissues were obtained from The Cancer Genome Atlas (TCGA) NCI GDC Data Portal (https:// portal.gdc.cancer.gov) (Additional file 1: Table S1).Since the TCGA ovarian cancer methylation dataset was based on the low-coverage Infinium HumanMethylation27 (27 K) array, we replaced it with an ovarian cancer 450 K array methylation dataset (GSE102119) [13].
For feature selection, we employed the Random Forest (RF) algorithm, which was used in the EPICUP CUP classifier and the DKFZ CNS tumor classifier [8,12].The combined methylation datasets of 23 cancer types were randomly split into a training set (30%) and a validation set (70%) (Fig. 1A, Additional file 1: Table S1).For every CpG site, an analysis of variance with one-way ANOVA was performed to compare methylation level (β values) among different cancer types.A Tukey's honest significant difference post hoc test was applied to features with significant difference.CpGs that were differentially methylated in at least one cancer type were selected (Δβ > 0.2, p < 0.01).A RF classifying algorithm was then trained in two consecutive steps: (1) the selected CpGs were employed to build a prediction model using the RF machine learning method (R package randomForest version 4.7-1.1),and the variable importance of each CpG site was calculated by the mean decrease in accuracy; (2) CpGs with reduced out-of-bag (OOB) error were added in order of descending variable importance.We used default values of the RF parameters: ntree = 500, node size = 1, mtry = sqrt (p), where p is the number of features.After five runs of the two-step procedure, a total of 744 CpGs were obtained by the union of 200 CpGs with highest variable importance from each run.Next, we evaluated the tissue of origin prediction performance of the top 50, 100, 150, 200, 250, and 300 features on the validation set.We found that the top 50 features had the lowest accuracy (~ 96%), while others had similar results (~ 98%).In consideration of methylation signal loss during capture probes synthesis and targeted bisulfite sequencing, we chose 200 as the number of features for classifier development and targeted methylation sequencing panel design.We retrained the RF model with the 744 CpGs and sorted them with variable importance.The top 200 CpGs with highest variable importance were selected as the final methylation feature.
For classifier development, we applied 450 K methylation array datasets from 32 cancer types (31 from TCGA) (Additional file 1: Table S2).Based on the similarity of DNA methylomes and/or tissue of origins, we made the following adjustments: uterine carcinosarcoma (UCS) and uterine corpus endometrial carcinoma (UCEC) were grouped as the uterine cancer (UC) cohort (n = 368); colon and rectum adenocarcinoma (COAD/READ) were grouped as the colorectal cancer (CRC) cohort (n = 283); acute myeloid leukemia (LAML) and diffuse large B-cell lymphoma (DLBC) were grouped as the hematolymphoid malignancies (HLM) cohort (n = 134); esophageal and stomach adenocarcinoma (EAC/STAD) were grouped as the upper gastrointestinal tract adenocarcinoma cohort

Methylation calling
The adapters, low-quality ends, and any sequencing reads less than 50 bp were removed by trim_galore (version 0.6.2).The reads were then mapped to the in-silico CT converted human RefSeq genome hg19 using Bismark (version 0.20.0).Duplicates were removed by the dedu-plicate_bismark module in Bismark.The methylation ratio for each CpG was calculated by the bismark_meth-ylation_extractor script in Bismark.

Methylation feature selection
Genome-wide Infinium 450 K DNA methylation array data of 7,385 tumor samples of known origin were obtained from TCGA (22 cancer types, n = 7294) and GSE102119 (ovarian cancer, n = 91) [6,13].Tumor samples were randomly assigned to the training (30%) and validation (70%) set.As described in methods, we chose the RF algorithm for feature selection and 200 as the number of features.The top 200 CpGs with highest variable importance were selected as the final feature for classifier development and targeted methylation sequencing panel design (Fig. 1A).A t-distributed stochastic neighbor embedding (t-SNE) dimensionality reduction plot showed the partition of different methylation classes representing 23 cancer types (Fig. 1B).
Analysis of the 200 targeted CpG sites revealed that 48% are located in CpG islands, 21% in CpG shore/shelf regions, and 31% are in other regions of the genome without any enrichment of CpG content (open sea) (Fig. 2A).Upon inspection, these 200 CpGs are enriched in gene body region, evenly distributed across promoter, 5'UTR and intergenic region, and underrepresented for the 3'UTR (Fig. 2B).The 200 CpGs are distributed among all autosomes except for chr 18 (Fig. 2C).As shown in Fig. 2D, promoter probes are most enriched in CGIs, and less enriched in CpG shelves and open sea.
Clustering analysis of the training DNA methylation dataset revealed that hypomethylated CpGs are enriched in CpG islands, and hypermethylated CpGs are enriched in CpG shelfs/shores and open sea, respectively.Tumors originating from the same tissue or organ tended to cluster (Fig. 3).These included melanoma of the skin and eye (SKCM/UVM), and two lung cancers (LUAD and LUSC).The gastrointestinal carcinomas (COAD, LIHC, PAAD, and STAD) grouped together.Two adrenal gland tumors (PCPG and ACC) also grouped closely with the combined kidney tumors (KIDNEY).Two squamous cell carcinomas (LUSC and CESC) associated closely (Fig. 3).

Classifier development with the elastic net algorithm
For methylation classifier development, we employed 31 out of 33 available TCGA methylation datasets.The original TCGA esophageal carcinoma (ESCA) study recommended treating esophageal adenocarcinoma (EAC) and squamous cell carcinoma (ESCC) as two entities [24].Consistently, the TCGA pan-cancer cell-of-origin study revealed that EAC clustered tightly with stomach adenocarcinoma (STAD), while head and neck squamous cell carcinoma (HNSC) and ESCC formed a Pan-Squamous cluster [6].Based on the latter work, we combined two colorectal cancers (COAD and READ), two uterine cancers (UCS and UCEC), two upper gastrointestinal tract cancers (EAC and STAD), two squamous cell carcinoma datasets (ESCC and HNSC), and two hematolymphoid maligancies (LAML and DLBC) in downstream analysis.
The expanded TCGA/GSE dataset was randomly split into the training set (30%) and validation set (70%) (Additional file 1: Table S2).Based on the 200-CpG probe set, we developed three different classifiers with an RF, a Lasso, and an elastic net (EN) model on the training set.As EN outperformed the other two models on the validation set, it was selected as the final algorithm for classifier development.The EN-based classifier MFCUP predicted the tissue of origin with an overall accuracy of 97.2% in the validation set (Fig. 5A).The sensitivity, specificity, positive and negative predictive values (PPVs and NPVs) for each of the 25 cancer types were shown in Fig. 5B.MFCUP achieved a prediction accuracy of 100% for CRC, GLIOMA, PRAD, TGCT and THCA (Fig. 5A, B,  C).Methylation classes represent different cancer types in the validation set also separated well in the t-distributed stochastic neighbor embedding (t-SNE) dimensionality reduction diagram (Additional file 1: Figure S2).

MFCUP-based methylation sequencing panel
The major obstacles for methylation-based CUP diagnosis included high cost and the lack of DNA methylation array facilities in most hospitals.To overcome these challenges, we developed a targeted methylation sequencing panel based on the 200 CpGs set of MFCUP.We evaluated the performance of this panel with 78 FFPE samples from 20 cancer types, in which it achieved a classification accuracy of 88.5% (69/78) (Fig. 8).

Discussion
Recent studies have shown that DNA methylation profiling can be a valuable aid for accurate diagnosis of cancers of nervous tissue and muscular tissue [7].For example, central nervous system (CNS) cancers are a heterogeneous group of tumors consisting around 100 entities, which makes accurate diagnosis of CNS tumor difficult.
The German Cancer Research Center (DKFZ) developed a clinical-grade CNS tumor classifier, which assigned a distinctive methylation signature to nearly all CNS tumor types [8,9].This classifier was trained with 2,801 tumor samples comprising 91 methylation classes, and resulted in a diagnosis change in 12% of prospective CNS tumor cases [8].Based on this work, DNA methylationbased tumor classification is now included in the World Health Organization (WHO) classification of adult and pediatric CNS tumors [41,42].Sarcomas are a heterogeneous group of solid tumors of mesenchymal origin, which are difficult to diagnosis due to the lack of defining histopathological features in some subtypes.The DKFZ group also developed a methylation-based sarcoma classifier, which achieved a prediction accuracy of 75% in the validation sarcoma cohort (n = 428) [17].In another validation study, the DKFZ sarcoma classifier was in accordance with the pathologic diagnosis in 88% of cases [43].These results suggest that DNA methylation profile may provide greater diagnosis precision than standard protocols.
To extend methylation-based cancer classification beyond single tissue-of-origin, several groups developed multi-cancer classifiers with large methylation datasets and machine learning, but challenges remained [6,[44][45][46].In a landmark study, Moran et al. established a DNA methylation-based CUP classifier, which can guide site-specific therapies for patients with CUP [12].Using unsupervised clustering of methylation profile of 3,139 cancer-hypermethylated CpGs, Hoadley et al. divided 10,814 tumor samples from the TCGA dataset into 25 methylation groups [6].Tang et al. [44] and Liu et al. [45] developed multi-cancer classifiers for tumor tissue/circulating-free DNA, respectively.However, these two classifiers target 5457/9223 CpGs, which were impractical in many clinical settings.Danilova et al. [46] developed a 305-CpG cancer classifier with a discovery set consisting five core cancer types.However, its prediction accuracy significantly decreased when applying to other cancer types.A cost-effective methylation sequencing panel, including dozens to hundreds excellent informative and Fig. 5 Cancer type classification accuracy of the expanded TCGA/GSE validation cohort (n = 5923).A Sample number and prediction accuracy (%) of each cancer type.B Sensitivity, specificity, PPV, and NPV for each of the 25 cancer types.C Confusion matrix (in percent) of the expanded TCGA/ GSE validation cohort of cancer type prediction using 200 selected probes.The percentages of correctly predicted samples are highlighted in green; misclassification events are highlighted in pink.Human organs are highly complex and composed of multiple tissue and cell types.Genome-wide methylation profiling studies have revealed distinct methylation patterns in different human tissue and cell types [47,48].Tissue-specific DNA methylation patterns provide a useful tool for the characterization of tissue-of-origin [47,49].Similarly, cell type-specific DNA methylation profiles enable cell type deconvolution in tissue samples [47,49].Both tissue-specific and cancer-specific DNA methylation patterns appear to be maintained during cancer evolution [7].A DNA methylation atlas based on deep whole-genome bisulfite sequencing of 39 normal human cell types demonstrated that almost all (97%) cell-type-specific differentially methylated regions (DMRs) are demethylated in one cell type but methylated in other cell types [47].The authors suggested that this atlas can be used to identify the tissue of origin of cfDNA in plasma of cancer patients.14% of these cell type specific DMRs are covered by the Infinium 450 K array [9].Interestingly, three CpGs in our methylation classifier are located in cell-type-specific unmethylation regions described in the normal human cell methylome study [47], including breast luminal epithelium cell marker cg17403702 (ARFIP2), kidney epithelial cell marker cg10572670 (ARHGEF28), and lung alveolar epithelial cell marker cg00794055 (TBC1D24).Consistently, these CpGs are hypomethylated in one cancer type and the corresponding normal tissue, but hypermethylated in other cancer types.Moreover, our data showed that the lung alveolar epithelial cell DNA methylation biomarker cg00794055 (TBC1D24) are hypomethylated in LUAD and the corresponding normal control (LUAD), but hypermethylated in LUSC.Deconvolution of the TCGA LUAD/LUSC 450 K DNA methylation array datasets revealed that the relative proportion of lung alveolar epithelial cell in LUAD and normal adjacent tissue (LUAD and LUSC) are approximately 25% but less than 5% in LUSC [47].This result explained why the methylation level of cg00794055 (TBC1D24) in LUSC was higher than LUAD.
Our work identified some validated cancer biomarkers.cg16104915, a CpG site located in the promoter CpG island of HOXA9, is a well-characterized biomarker in our 200-CpG set.It is methylated in 97% of NSCLC TCGA samples but not normal tissue [50].HOXA9 methylation is also a validated biomarker for cutaneous melanoma progression, with high methylation in metastases but low methylation in primary melanoma and nevi [20].Our CpG set also included three known biomarker genes for colorectal cancer (LIFR, OSMR, QKI).The methylation levels of 10 CpGs in the QKI promoter were significantly higher in CRC than in normal tissues and other cancer types [16].cg24583770 was adjacent to these 10 CpGs, and its hypermethylation status also distinguished colon cancer from normal tissues and other cancer types.Methylation of a segment of the OSMR promoter CGI (from -282 to -224) was found in 90% of colon cancer, 55% of normal-appearing mucosa adjacent to colon cancer, 33% of gastric cancer, and 20% of pancreatic cancer [51].cg17528648 was in the 5'-UTR region of this OSMR CGI, and its hypermethylation distinguished colon cancer from adjacent normal mucosa and other cancer types.Hypermethylation of a CpG island located in the promoter of HOXD8 (chr2:176,993,479-176,995,557) is a validated biomarker of biliary tract cancer [52].(Additional file 2).
Through inspection of our 200-CpG set, we found some potential biomarkers for cancer type diagnosis.For instance, four CpGs within the same CpG island of TMEM101, a potential biomarker for reduced overall survival in breast cancer patients [53], were hypermethylated in UCEC but not in other cancer types.Similarly, cg25927164 (RAI1) was hypermethylated in BLCA only.Further studies are required to determine whether hypermethylation of TMEM101 and RAI1 could be used as biomarkers for the screen and diagnosis of endometrial and bladder cancer, respectively (Additional file 3).

Conclusions
In summary, we developed a DNA methylation-based CUP classifier (MFCUP) with machine learning algorithms.To make DNA methylation-based diagnosis accessible and affordable, we established and validated a targeted methylation sequencing panel, which demonstrated high accuracy in identifying the primary sites for CUP.Lastly, our work revealed some CpGs with biomarker potential for cancer type classification.

Fig. 6
Fig.6 Cancer type classification accuracy of the Infinium 450 K array testing datasets.A Sample number and prediction accuracy (%) of nine cancer types.B Confusion matrix (in percent) of the cancer type prediction using 200 selected probes for testing datasets generated by infinium 450 K methylation array.The percentages of correctly predicted samples are highlighted in green; misclassification events are highlighted in pink.True histology/predicted histology is respectively listed in rows/columns.BLCA Bladder urothelial carcinoma, BRCA Breast invasive carcinoma, CESC Cervical squamous cell carcinoma and endocervical adenocarcinoma; CRC Colorectal cancer, HN/ESCC Head and neck squamous carcinoma and esophageal squamous cell carcinoma, LIHC Liver hepatocellular carcinoma, LUAD Lung adenocarcinoma, LUSC Lung squamous cell carcinoma, PAAD Pancreatic adenocarcinoma, PRAD Prostate adenocarcinoma, SARC Sarcoma, TGCT Testicular germ cell tumors, THYM Thymoma, UC Uterine cancer, Upper GI Upper gastrointestinal adenocarcinoma

Fig. 7
Fig.7 Cancer type classification accuracy of the Infinium 850 K array testing datasets.A Sample number and prediction accuracy (%) of 15 cancer types.B Confusion matrix (in percent) of the cancer type prediction using 200 selected probes for testing datasets generated by infinium 850 K methylation array.The percentages of correctly predicted samples are highlighted in green; misclassification events are highlighted in pink.True histology/predicted histology is respectively listed in rows/columns.ACC Adrenocortical carcinoma, BLCA Bladder urothelial carcinoma, CESC Cervical squamous cell carcinoma and endocervical adenocarcinoma, CRC Colorectal cancer, HLM Hematolymphoid malignancies, HN/ESCC Head and neck squamous carcinoma and esophageal squamous cell carcinoma, LIHC Liver hepatocellular carcinoma, LUAD Lung adenocarcinoma, LUSC Lung squamous cell carcinoma, MESO Mesothelioma, PAAD Pancreatic adenocarcinoma, PCPG Pheochromocytoma and paraganglioma, PRAD Prostate adenocarcinoma, SARC Sarcoma, SKCM Skin cutaneous melanoma, TGCT Testicular germ cell tumors, THCA Thyroid carcinoma, UC Uterine cancer, Upper GI Upper gastrointestinal adenocarcinoma, UVM Uveal melanoma

Fig. 8
Fig.8 Confusion matrix (in percent) of the validation set of FFPE tumor tissues from 20 cancer types.Confusion matrix of the validation set (n = 78) of cancer type prediction using 200 selected probes.The percentages of correctly predicted samples are highlighted in green; misclassification events are highlighted in pink.True histology/predicted histology is respectively listed in rows/columns.ACC Adrenocortical carcinoma, BLCA Bladder urothelial carcinoma, BRCA Breast invasive carcinoma, CESC Cervical squamous cell carcinoma and endocervical adenocarcinoma; CRC Colorectal cancer, HLM Hematolymphoid malignancies, HN/ESCC Head and neck squamous carcinoma and esophageal squamous cell carcinoma, LIHC Liver hepatocellular carcinoma, LUAD Lung adenocarcinoma, LUSC Lung squamous cell carcinoma, MESO Mesothelioma, PAAD Pancreatic adenocarcinoma, PRAD Prostate adenocarcinoma, SKCM Skin cutaneous melanoma, TGCT Testicular germ cell tumors, THCA Thyroid carcinoma, UC Uterine cancer, Upper GI Upper gastrointestinal adenocarcinoma

EN-based classifier validation with non-TCGA methylation array datasets
). ACC Adrenocortical carcinoma, BLCA Bladder urothelial carcinoma, BRCA Breast invasive carcinoma, CESC Cervical squamous cell carcinoma and endocervical adenocarcinoma, COAD Colon adenocarcinoma, LAML Acute myeloid leukemia, LIHC Liver hepatocellular carcinoma, LUAD Lung adenocarcinoma, LUSC Lung squamous cell carcinoma, MESO Mesothelioma, PAAD Pancreatic adenocarcinoma, PCPG Pheochromocytoma and paraganglioma, PRAD Prostate adenocarcinoma, SARC Sarcoma, SKCM Skin cutaneous melanoma, STAD Stomach adenocarcinoma, TGCT Testicular germ cell tumors, THCA Thyroid carcinoma, UCEC Uterine corpus endometrial carcinoma, UVM Uveal melanoma True histology/predicted histology is respectively listed in rows/columns.ACC Adrenocortical carcinoma, BLCA Bladder urothelial carcinoma, BRCA Breast invasive carcinoma, CESC Cervical squamous cell carcinoma and endocervical adenocarcinoma; CRC Colorectal cancer, HLM Hematolymphoid malignancies, HN/ESCC Head and neck squamous carcinoma and esophageal squamous cell carcinoma, LIHC Liver hepatocellular carcinoma, LUAD Lung adenocarcinoma, LUSC Lung squamous cell carcinoma, MESO Mesothelioma, PAAD Pancreatic adenocarcinoma, PCPG Pheochromocytoma and paraganglioma, PRAD Prostate adenocarcinoma, SARC Sarcoma, SKCM Skin cutaneous melanoma, TGCT Testicular germ cell tumors, THCA Thyroid carcinoma, THYM Thymoma, UC Uterine cancer, Upper GI Upper gastrointestinal adenocarcinoma, UVM Uveal melanoma (See figure on next page.) Upper GI Upper gastrointestinal adenocarcinoma discriminative CpG markers, is still lacking in the clinical practice of CUP diagnosis.Our aim was to develop an accessible and affordable DNA methylation-based CUP diagnosis assay independent of the high-throughput methylation array platform.Further studies are needed to evaluate the performance of our targeted methylation sequencing panel on metastases.